6.3 - Challenge - Improving the model

For this challenge you will be asked to apply what you have learned to create a RNN model that can generate new sequences of text based on what it has learned from a large set of existing text. In this case we will be using the full text of Lewis Carroll's Alice in Wonderland. Your task for the assignment is to:

  • format the book text into a set of training data
  • define a RNN model in Keras based on one or more LSTM or GRU layers
  • train the model with the training data
  • use the trained model to generate new text

Our previous model based on Obama's essay was prone to overfitting since there was not that much data to learn from. Thus, the generated text was either unintelligeable (not enough learning) or exactly replicated the training data (over-fitting). In this case, we are working with a much bigger data set, which should provide enough data to avoid over-fitting, but will also take more time to train. To improve your model, you can experiment with tuning the following hyper-parameters:

  • Use more than one recurrent layer and/or add more memory units (hidden neurons) to each layer. This will allow you to learn more complex structures in the data.
  • Use sequences longer than 100 characters, which will allow you to learn from patterns further back in time.
  • Change the way the sequences are generated. For example you could try to break up the text into real sentances using the periods, and then either cut or pad each sentance to make it 100 characters long.
  • Increase the number of training epochs, which will give the model more time to learn. Monitor the validation loss at each epoch to make sure the model is still improving at each epoch and is not overfitting the training data.
  • Add more dropout to the recurrent layers to minimize over-fitting.
  • Tune the batch size - try a batch size of 1 as a (very slow) baseline and larger sizes from there.
  • Experiment with scale factors (temperature) when interpreting the prediction probabilities.

If you get an error such as alloc error or out of memory error during training it means that your computer does not have enough RAM memory to store the model parameters or the batch of training data needed during a training step. If you run into this issue, try reducing the complexity of your model (both number and depth of layers) or the mini-batch size.

The last three code blocks will use your trained model to generate a sequence of text based on a predefined seed. Do not change any of the code, but run it before submitting your assignment. Your work will be evaluated based on the quality of the generated text. A good result should be legible with decent grammar and spelling (this indicates a high level of learning), but the exact text should not be found anywhere in the actual text (this indicates over-fitting).

Let's start by importing the libraries we will be using, and importing the full text from Alice in Wonderland:


In [ ]:
import numpy as np
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Dropout
from keras.layers import LSTM
from keras.callbacks import ModelCheckpoint
from keras.utils import np_utils

from time import gmtime, strftime
import os
import re
import pickle
import random
import sys

In [ ]:
filename = "data/wonderland.txt"
raw_text = open(filename).read()

raw_text = re.sub('[^\nA-Za-z0-9 ,.:;?!-]+', '', raw_text)
raw_text = raw_text.lower()

n_chars = len(raw_text)
print("length of text:", n_chars)
print("text preview:", raw_text[:500])

In [ ]:
# write your code here

Do not change this code, but run it before submitting your assignment to generate the results


In [ ]:
def sample(preds, temperature=1.0):
    preds = np.asarray(preds).astype('float64')
    preds = np.log(preds) / temperature
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    probas = np.random.multinomial(1, preds, 1)
    return np.argmax(probas)

In [ ]:
def generate(sentence, sample_length=50, diversity=0.35):
    generated = sentence
    sys.stdout.write(generated)

    for i in range(sample_length):
        x = np.zeros((1, X.shape[1], X.shape[2]))
        for t, char in enumerate(sentence):
            x[0, t, char_to_int[char]] = 1.

        preds = model.predict(x, verbose=0)[0]
        next_index = sample(preds, diversity)
        next_char = int_to_char[next_index]

        generated += next_char
        sentence = sentence[1:] + next_char

        sys.stdout.write(next_char)
        sys.stdout.flush()
    print()

In [ ]:
prediction_length = 500
seed = "this time alice waited patiently until it chose to speak again. in a minute or two the caterpillar t"

generate(seed, prediction_length, .50)